Python dataframes with pandas and polars

Andreas Beger and Isaac Chung
PyData Tallinn x Python CodeClub
27 November 2024

Bios

Andreas Beger

  • 🏢 Data Scientist, Consult.
  • 🏃‍♂️🐌 Slow marathoner
  • 📍 🇩🇪/🇭🇷 → 🇺🇸 → 🇪🇪
  • 🎓 PhD Political Science

Isaac Chung

  • 🏢 Staff Data Scientist, Wrike
  • 🏊‍♂️🚴🏃‍♂️ Fast triathlete
  • 📍 🇭🇰 → 🇨🇦 → 🇪🇪
  • 🎓 MS Machine Learning

🐍 We are also the PyData Tallinn co-organizers.

Agenda

  1. Prelude: setting up, what are dataframes?
  2. (notebook) pandas and basic dataframe concepts and operations
  3. (notebook) polars, retread basic and also cover more advanced operations
  4. The bigger picture: pandas vs polars, other frameworks

Using GitHub Codespaces

Open dataframes.ipynb

To just follow along

While we wait

  • Who has used pandas before?

  • polars?

  • Another data framework in Python, e.g. database + SQL?

  • Does code like this mean anything to you?

    titanic %>% 
      select(Pclass, Survived) %>% 
      group_by(Pclass) %>% 
      summarize(passengers = n(), surv_rate = mean(Survived))

What are dataframes?

Definition

  • Dataframes are a data type representing 2D tables
  • Where the columns have names
  • Unlike matrices or arrays, columns might have different data types
  • And the rows are identified by one or more ID variables
x y group
1 2 a
4 7 b
3 8 a
9 2 b

Why?

Imagine working with tabular data if we didn’t have dataframes and associated methods.

by_rows = [
    {"x": 1, "y": 2, "group": "a"},
    {"x": 4, "y": 7, "group": "b"},
    {"x": 3, "y": 8, "group": "a"},
    {"x": 9, "y": 2, "group": "b"}
]


by_columns = {
    "x": [1, 4, 3, 9],
    "y": [2, 7, 8, 2],
    "group": ["a", "b", "a", "b"]
}

Section 1: pandas

History

  • Created by Wes McKinney, now at Posit PBC
  • Started in 2008
  • Originally built on top of numpy

Getting started

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "quarter": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    "x": np.random.randn(12),
    "date": pd.date_range("2024-01-01", periods=12, freq="MS")
})

df.head()
quarter x date
0 1 -1.125680 2024-01-01
1 1 -0.112682 2024-02-01
2 1 0.259902 2024-03-01
3 2 0.426993 2024-04-01
4 2 0.669239 2024-05-01

Components of a dataframe

Series

df.x
0    -1.125680
1    -0.112682
2     0.259902
3     0.426993
4     0.669239
5     0.429788
6     1.998504
7     0.445176
8     1.592101
9     0.135922
10    0.115432
11   -0.824800
Name: x, dtype: float64

Columns

df.columns
Index(['quarter', 'x', 'date'], dtype='object')

Index

df.index
RangeIndex(start=0, stop=12, step=1)

Let’s look at some real data

Estonian vehicle accident data

  • 📖 ✍️ read and write
  • 🔬 inspect

Input - reading data

accidents = pd.read_csv("data/estonia-traffic-accidents-clean.csv")

Inspecting

accidents.shape
(14259, 8)
accidents.columns
Index(['date', 'persons_involved', 'killed', 'injured', 'county',
       'pedestrian_involved', 'accident_type', 'light_conditions'],
      dtype='object')
accidents.head()
date persons_involved killed injured county
0 2014-10-24 08:45:00 2 0 1 Harju maakond
1 2014-10-24 13:45:00 2 0 1 Harju maakond
2 2014-08-11 00:00:00 2 0 1 Harju maakond
3 2014-11-17 17:32:00 2 0 2 Harju maakond
4 2015-04-28 07:55:00 2 0 1 Harju maakond

Inspecting

accidents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14259 entries, 0 to 14258
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   date                 14259 non-null  object
 1   persons_involved     14259 non-null  int64 
 2   killed               14259 non-null  int64 
 3   injured              14259 non-null  int64 
 4   county               14259 non-null  object
 5   pedestrian_involved  14259 non-null  int64 
 6   accident_type        14259 non-null  object
 7   light_conditions     14259 non-null  object
dtypes: int64(4), object(4)
memory usage: 891.3+ KB

What time period does the data cover?

  • 🛒 select columns
  • 🥪 mutate, add columns

Selecting columns

Different ways, one is indexing with []:

accidents["date"].head(4)
0    2014-10-24 08:45:00
1    2014-10-24 13:45:00
2    2014-08-11 00:00:00
3    2014-11-17 17:32:00
Name: date, dtype: object

Multiple columns

accidents[["date", "county"]].head(4)
date county
0 2014-10-24 08:45:00 Harju maakond
1 2014-10-24 13:45:00 Harju maakond
2 2014-08-11 00:00:00 Harju maakond
3 2014-11-17 17:32:00 Harju maakond

Mutating columns

Right now date is stored as a string:

accidents["date"][0]
'2014-10-24 08:45:00'
type(accidents["date"][0])
str


Convert it to proper data type:

accidents["date"] = pd.to_datetime(accidents["date"])
type(accidents["date"][0])
pandas._libs.tslibs.timestamps.Timestamp

Pandas Series

dates = accidents["date"]
type(dates)
pandas.core.series.Series


start = accidents["date"].min()
end = accidents["date"].max()
print(f"First accident: {start}\nLast accident: {end}")
First accident: 2011-01-05 00:00:00
Last accident: 2021-12-31 23:45:00


accidents["accident_type"].value_counts()
accident_type
Kokkupõrge            5605
Ühesõidukiõnnetus     3946
Jalakäijaõnnetus      3386
Muu liiklusõnnetus    1262
Teadmata                60
Name: count, dtype: int64

How many accidents were in Harju county?

  • 🔍 filter rows

Filtering rows

accidents[accidents["county"] == "Harju maakond"].shape
(7000, 8)


accidents["county"] == "Harju maakond"
0         True
1         True
2         True
3         True
4         True
         ...  
14254    False
14255    False
14256     True
14257    False
14258    False
Name: county, Length: 14259, dtype: bool

How many people were harmed in total?

  • 🥪 mutate, add columns

Mutating dataframes

accidents["killed_or_injured"] = accidents["killed"] + accidents["injured"]
accidents[['killed', 'injured', 'killed_or_injured']].head()
killed injured killed_or_injured
0 0 1 1
1 0 1 1
2 0 1 1
3 0 2 2
4 0 1 1


accidents["killed_or_injured"].sum()
np.int64(18021)
sum(accidents["killed_or_injured"])
18021

What about by accident type?

  • 👨‍👩‍👧‍👦 group and aggregate

Grouping and summarizing

by_type = accidents.groupby("accident_type").agg({"killed_or_injured": "sum"})
by_type
killed_or_injured
accident_type
Jalakäijaõnnetus 3548
Kokkupõrge 7951
Muu liiklusõnnetus 1436
Teadmata 70
Ühesõidukiõnnetus 5016

(Optional) More on indices

See notebook.

(Optional) Cleaning the accidents data

See notebook.

pandas is great


2017, Wes McKinney (creator of pandas):

10 Things I Hate About Pandas

  • Inefficient memory management, need 5-10x data size
  • Eager evaluation → limited query planning
  • No multi-core
  • Apache Arrow (2016 - …)

Section 2: polars

History

  • Created in 2020 by Ritchie Vink
  • Structural engineer gone data scientist/engineer
  • Written in Rust
  • Uses Arrow as internal representation

Getting started

import polars as pl

accidents = pl.read_csv("data/estonia-traffic-accidents-clean.csv")
accidents.head()
shape: (5, 5)
date persons_involved killed injured county
str i64 i64 i64 str
"2014-10-24 08:45:00" 2 0 1 "Harju maakond"
"2014-10-24 13:45:00" 2 0 1 "Harju maakond"
"2014-08-11 00:00:00" 2 0 1 "Harju maakond"
"2014-11-17 17:32:00" 2 0 2 "Harju maakond"
"2015-04-28 07:55:00" 2 0 1 "Harju maakond"

polars is different from pandas

But, we can always convert back and forth:

import pyarrow

df = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
# to make this a pandas dataframe
# (requires pyarrow)
df_pd = df.to_pandas()
# to convert it back to polars dataframe
df_pl = pl.DataFrame(df_pd)

How many were harmed in the biggest accident?

  • 🛒 select columns
  • 🥪 mutate, add columns
  • 🔍 filter rows

Expressions!

Selecting columns

accidents.select("date", "county").head()
shape: (5, 2)
date county
str str
"2014-10-24 08:45:00" "Harju maakond"
"2014-10-24 13:45:00" "Harju maakond"
"2014-08-11 00:00:00" "Harju maakond"
"2014-11-17 17:32:00" "Harju maakond"
"2015-04-28 07:55:00" "Harju maakond"

Expressions

Expressions are abstract, composable data transformations that are executed with a context that provides data.

accidents.select(pl.col("date")).head(1)
shape: (1, 1)
date
str
"2014-10-24 08:45:00"
foo = pl.col("date")
foo
col("date")
accidents.select(foo).head(1)
shape: (1, 1)
date
str
"2014-10-24 08:45:00"

They can be composed

How many people were harmed in the biggest accident in our data?

accidents.select(
    # select 'killed'
    pl.col("killed")
    # add 'injured'
    .add(pl.col("injured"))
    # give the result a new column name
    .alias("killed_or_injured")
    # identify the max value
    .max())
shape: (1, 1)
killed_or_injured
i64
23

Expressions work in multiple contexts

  • select()
  • filter()
  • with_columns(): mutating dataframes
  • group_by() and aggregations

How many accidents were in Harju county?

  • 🔍 filter rows

Filtering rows

accidents.filter(pl.col("county").eq("Harju maakond")).shape
(7000, 8)
accidents.filter(pl.col("county")=="Harju maakond").shape
(7000, 8)


Let’s be slightly lazy:

accidents.filter(pl.col("county").str.contains("Harju")).shape
(7000, 8)

What fraction of people involved in an accident were harmed?

  • 🥪 mutate, add columns

Mutating dataframes

with_columns() + expressions

accidents = accidents.with_columns(
    pl.col("killed").add(pl.col("injured")).alias("killed_or_injured"),
    pl.col("killed").add(pl.col("injured")).truediv(pl.col("persons_involved")).alias("harmed_rate")
)
accidents.select(["date", "persons_involved", "killed_or_injured", "harmed_rate"]).head(5)
shape: (5, 4)
date persons_involved killed_or_injured harmed_rate
str i64 i64 f64
"2014-10-24 08:45:00" 2 1 0.5
"2014-10-24 13:45:00" 2 1 0.5
"2014-08-11 00:00:00" 2 1 0.5
"2014-11-17 17:32:00" 2 2 1.0
"2015-04-28 07:55:00" 2 1 0.5

Which county had the most accidents?

  • 👨‍👩‍👧‍👦 group and aggregate

Group and summarize/aggregate

group_by() + agg() or with_columns()

by_county = (accidents
             .group_by("county")
             .agg(pl.len().alias("accidents"),
                  pl.col("killed_or_injured").sum())
             .sort("accidents", descending=True)
)
by_county.head()
shape: (5, 3)
county accidents killed_or_injured
str u32 i64
"Harju maakond" 7000 8423
"Tartu maakond" 1591 1968
"Pärnu maakond" 1008 1293
"Ida-Viru maakond" 991 1348
"Lääne-Viru maakond" 645 883

What’s the per capita accident victim rate, by county?

(Optional)

Joining dataframes

county_pop = (pl.read_csv("data/county-pop.csv", skip_rows=2)
              .rename({"County": "county", "Age groups total": "population"})
              .select(["county", "population"])
              # this has "county" in the county names, not "maakond"
              .with_columns(pl.col("county").str.replace("county", "maakond"))
              )

by_county_w_pop = by_county.join(county_pop, on="county", how="left")
by_county_w_pop.head(3)
shape: (3, 4)
county accidents killed_or_injured population
str u32 i64 i64
"Harju maakond" 7000 8423 598059
"Tartu maakond" 1591 1968 152977
"Pärnu maakond" 1008 1293 85938

Joining dataframes

Now we can use some simple select + expressions do to the math:

by_county_w_pop.select(
    pl.col("county"), 
    pl.col("killed_or_injured"),
    pl.col("killed_or_injured").truediv(pl.col("population")).mul(1000).alias("rate/1000")
    ).head(3)
shape: (3, 3)
county killed_or_injured rate/1000
str i64 f64
"Harju maakond" 8423 14.083895
"Tartu maakond" 1968 12.864679
"Pärnu maakond" 1293 15.045731

Who wears reflectors? Men or women? Young or old?

We will use new data for this.

  • 🧱 reshape wide, long

Reshaping / pivoting dataframes

reflectors = (pl.read_csv("data/reflectors.csv", has_header=True, separator=";", skip_rows=2)
              .filter(pl.col("Sex").ne("Men and women"))
              .drop(["Type of data", "Year", "All age groups (16-64)"])
              .sort("Reflector use", "Sex")
)
reflectors.head()
shape: (5, 7)
Reflector use Sex 16-24 25-34 35-44 45-54 55-64
str str f64 f64 f64 f64 f64
"Nearly always" "Men" 34.3 40.5 52.2 58.6 55.9
"Nearly always" "Women" 58.4 64.9 71.4 78.4 74.9
"Never" "Men" 14.3 12.4 7.2 3.9 2.7
"Never" "Women" 8.8 5.0 4.6 2.0 2.5
"Never walk on dark streets, ro… "Men" 4.8 10.8 9.7 11.3 12.8

Reshaping / pivoting dataframes

reflectors = (reflectors
              .unpivot(index=["Reflector use", "Sex"], 
                       variable_name="age_group", 
                       value_name="percentage")
            .sort("Reflector use", "Sex", "age_group")
)
reflectors.head()
shape: (5, 4)
Reflector use Sex age_group percentage
str str str f64
"Nearly always" "Men" "16-24" 34.3
"Nearly always" "Men" "25-34" 40.5
"Nearly always" "Men" "35-44" 52.2
"Nearly always" "Men" "45-54" 58.6
"Nearly always" "Men" "55-64" 55.9

Plot reflector use by age and gender

reflectors.get_column("Reflector use").unique().to_list()
['Never', 'Never walk on dark streets, roads', 'Sometimes', 'Nearly always']
(reflectors
 .filter(pl.col("Reflector use").eq("Never"))
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(width=700, height=300)
)

Modified plot

One category is “Never walk on dark streets, roads”…🧐

(reflectors
 .with_columns(pl.col("Reflector use").str.replace("Never walk on dark streets, roads", "Never"))
 .group_by(["Reflector use", "Sex", "age_group"])
 .agg(pl.col("percentage").sum())
 .filter(pl.col("Reflector use").eq("Never"))
 .sort(["age_group", "Sex"])
 .plot.line(x = "age_group", y = "percentage", color = "Sex")
 .properties(width=700, height=300)
)

Why you should plot your data 😼

df1 = pl.read_csv("data/dataset1.csv")
df1.shape
(142, 2)
stats = ["mean", "std", "25%", "75%"]
(df1
 .describe()
 .filter(pl.col("statistic").is_in(stats))
)
shape: (4, 3)
statistic x y
str f64 f64
"mean" 54.2661 47.834721
"std" 16.769825 26.939743
"25%" 39.706326 24.46783
"75%" 69.359559 71.806616
df2 = pl.read_csv("data/dataset2.csv")
df2.shape
(142, 2)
stats = ["mean", "std", "25%", "75%"]
(df2
 .describe()
 .filter(pl.col("statistic").is_in(stats))
)
shape: (4, 3)
statistic x y
str f64 f64
"mean" 54.263273 47.832253
"std" 16.765142 26.935403
"25%" 44.1026 25.2564
"75%" 64.8718 69.1026

Why you should plot pt2

df1.plot.point("x", "y")
df2.plot.point("x", "y")

🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖🦖

The big picture

Andy is a polars stan

Comparison

pandas

  • ✅ Very widely used and supported
  • ✅ Stable
  • ❓ More imperative, traditional API
  • ❌ Inconsistent API, multiple ways of doing the same thing

polars

  • ✅ More consistent, functional-style API
  • ✅ Faster, less memory footprint
  • ✅ Works with OOM datasets out of the box
  • ❌ API still changing

Other frameworks

Thank you!

Scan this and let us know how we did 🤗